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Abstract 

The primitive data for deducing the Miyazawa-Jernigan contact energy or BLOSUM score metrix 
is the pair frequency counts. Each amino acid corresponds to a distribution. Taking the Kullback- 
Leibler distance of two probability distributions as resemblance coefficient and relating cluster to mixed 
population, we perform cluster analysis of amino acids based on the frequecy counts data. Furthermore, 
Ward's clustering is also obtained by adopting the average score as an objective function. An ordinal 
cophenetic is introduced to compare results from different clustering methods. 

Introduction 

Experimental investigation has strongly suggested that protein folding can be achieved with fewer letters 
than the 20 naturally occuring amino acids (Chan, 1999; Plaxco et ai, 1998). The native structure and 
physical properties of protein Rop is maintained when its 32-residue hydrophobic core is formed with only 
Ala and Leu residues (Munson et ai, 1994). Another example is the five-letter alphabet of Baker's group for 
38 out of 40 selected sites of SH3 chain (Riddle et ai, 1997). The mutational tolerance can be high in many 
regions of protein sequences. Heterogeneity or diversity in interaction must be present for polypeptides to 
have protein-like properties. However, physics and chemistry for polypeptide chain consisting of fewer than 
20 letters may be sufficiently simplified for a thorough understanding of the protein folding. 

A central task of protein sequence analysis is to uncover the exact nature of the information encoded in 
the primary structure. We still cannor read the language describing the final 3D fold of an active biological 
macromolecule. Compared with DNA sequence, protein sequence is generally much shorter, but the size of 
the alphabet five times larger. A proper coarse graining of the 20 amino acids into fewer clusters is important 
for improving the signal-to-noise ratio when extracting information by statistical means. 

Based on Miyazawa-Jernigan's (MJ) residue-residue statistical potential (Miyazawa and Jernigan, 1996), 
Wang and Wang (1999) (WW) reduced the alphabet. They introduced a 'minimal mismatch' principle to 
ensure that all interactions between amino acids belonging to any two given groups are as similar to one 
another as possible. The knowledge-based MJ potential is derived from the frequencies of contacts between 
different amino acid residues in a set of known native protein structure database. Murphy, Wallqvist and 
Levy (2000) (MWL) approached the same problem using the BLOSUM metrix derived by Henikoff and 
Henikoff (1992). The metrix is deduced from amino acid pair frequencies in aligned blocks of a protein 
sequence database, and widely used for sequence alignment and comparison. 

The problem of alphabet reduction may be viewed as cluster analysis, which is a well developed topic 
(Romesburg, 1984; Spath, 1985). WW used the mismatch as an objective function without any resemblance 
measure. MWL adopted a cosine-like resemblance coefficient (with a non-standard normalization) from the 
BLOSUM score metrix without any objective function, and took the arithmetic mean of scores to define 
the cluster center. It is our purpose to propose an entropic algorithm for selecting reduced alphabet in a 
consistent and systematic way. 
Materials and methods 

Either the MJ contact energies or BLOSUM score metrices are deduced from the primitive frequency 
counts of amino acid pairs. Taking the BLOSUM metrix as an example for specificity, following Henikoff and 
Henikoff (1992), we denote the total number of amino acid i, j pairs (1 < i,j < 20) by fij. It is convenient 
to introduce another set of with = fij/2 for i j and //^ — fa, which defines a joint probability for 
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each i, j pair 

20 20 

i=i j=i 

The probability for the amino acid i to occur is then 

20 

P^=T.^lr (2) 

The BLOSUM score corresponds to the logarithm of odds 

Stj ='^og2[qlj/{PtPj)]. (3) 

Each Amino acid i may be described by the conditional probability vector {p{j\i)}^=i with p{j\i) = qlj/Pi- 
In the language of cluster analysis, the objects are the 20 amino acids, and the attributes are p{j\i). 

A ruler to measure the similarity between the distributions {pi} and {qi} is the KuUback-Lciblcr distance 
D (also called relative entropy) of the probability distributions q from p (KuUback, 1959; KuUback et al, 
1987; Sakamoto et al, 1986): 

D{p, q)=J2Pi log(Pi/%)- (4) 

i 

This distance is always non-negative, and not symmetric in general. We may make symmetrization to use 
D = [D{p,q) + D{q,p)]/2. It will be used as the resemblace coefficient or distance for clustering. For 

frequancy counts, clustering two amino aids is just merging or summing up their counts. A cluster then 
corresponds to a mixed population. That is, the cluster center of amino acid i and j is described by 

Qkj,k = Qik + 'ljk^ PiScj =Pi+Pj- (5) 

With the resemblance coefficient and cluster center defined, routine cluster algorithms, such as the centroid 
method, may be applied. 

Henikoff and Henikoff (1992) defined the average mutual information or the average score: 

20 20 20 20 / 

i=l j=l i=l j=l ^^^^ 

which is again a KuUback-Leibler distance. The difference between H after and before clustering of i and j 

is related to terms like 

/ , M qik + qjk , qik , qjk 

[qik + qjk) log 7 ; i qik log qjk log , (7) 

{Pi+Pj)Pk PiPk PjPk 

which, by introducing Xj = qik/Pi, Xj = qjk/Pj, = Pil{Pi + Pj), '^j = Pj/{Pi + Pk) and (/(x)) = 
'^if{xi) + ujjf{xj), is proportional to f{{x)) — {f{x)) with f{x) = xlogx. From the Jcsscn theorem for 
convex function (xlogx here) (Rassias, 2000; Rassias and Srivastava, 1999), H never increases after each 
step of clustering. To make the average score as closer to that before a coarse-graining as possible, we should 
maximize H. This average mutual information H can be chosen as the objective function for clustering 
with respect to scores. Compared with the above approach based on the conditional probability p{j\i), 
this objective function also takes abundance of amino acids into account. We shall use Ward's methord 
(Romcsburg, 1984) to perform clustering. 
Results 

By means of the entropic Kullback-Leibler distance, defining the center of cluster by the distribution of 

the mixed population, we conduct cluster analysis on the MJ frequency counts with the centroid method. The 
result of the hierachical steps of clustering is shown in Table I. This will be refered to as the MJ-clustering. 
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We do sec Baker's five representative letters (AIGEK) at step 14, which ends at 6 clusters including the 
cluster consisting of the extraordinary sigle member Cys. 

Our most cluster analysis is done based on the BLOSUM 62 frequency counts. The counterpart of Table 
I for BLOSUM is Table II. Taking the average score H as the objective function for maximization, the 
clustering result of Ward's method is given in Table III. These two clusterings will be referred to as the 
HH- and BL-clustering, respectively. For the BL-clustering, when number of clusters becomes smaller, the 
average score decreases faster as shown in Fig. 1. When the total number of clusters is three, the score drops 
to about the half of its original value. 

Clustering result can be represented by a tree. The cophenetic metrix built by tracing distances along 
the tree is equivalent to the tree. The correlation between the cophenetic matrix and resemblance matrix is 
often used to measure the quality of clustering. We introduce the ordinal cophenetic metrix by taking the 
clustering depth as the distance. For example, Y and Q group together at step 5 in Table I. The YQ element 
of the ordinal cophenetic metrix is 5 as shown in Table IV, where the lower and upper matrices correspond 
to Tables I and II, respectively. In this way we ignore some numerical details, and focus on the order of 
the nodes in the tree. We compare the BL-clustering with the MJ- and HH-clusterings by calculating the 
difference between their ordinal cophenetic matrices. As shown in Table 5, the two clusterings HH and BL 
are closer to each other than MJ and BL are. Large positive and negative values of entries indicate main 
differences. The MJ-clustering prefers F to group with M, and Q with Y, while the BL-or HH-clustering 
prefers F to group with Y, and Q with E. In all the three clusterings the separation of hydrophobic and 
hydrophilic groups is rather clear. 
Discussion 

We have done cluster analysis also based on the BLOSUM 50 and 90. The results are very close to those 
obtained for the BLOSUM 62. 

The clustering based on MJ shows avident discrepency from that based on BLOSUM. From the way 
obtaining the frequency acounts, the BLOSUM data is more relavent to evolutional difference of residues, 
while the MJ data to structure difference. There are many amino acid difference formulas (Grantham, 1974). 
From composition c (defined as the atomic weight ratio of noncarbon elements in end groups to carbons in 
the side chain), polarity p and volume v Grantham (1974) derived an amino acid defferenee matrix, which 
exhibits stronger correlation with evolution than the method of minimum base changes between codons. 
This difference metrix is also a good candidate of resemblance metrix for clustering. To least disturb the 
data, we perform the UPGMA (unweighted pair-group method using arithmetic average) clustering (referred 
to as GR) on the data. The difference in the ordinal cophenetic metrices is shown in Table VI. Compared 
with MJ, BL is close to GR derived from physicochemical properties of amino acids. The average absolute 
difference of 190 entries are 1.84 and 2.84 for BL — GR and HH — GR, respectively. Since different structure 
regularities prefer certain residues, residue clustering should not be identical in all structure subclasses. 
Structure subclass specific clustering would give us more insight. 

For the BLOSUM data, the direct use of pair frequency counts provides us a consistent way to derive 
coarse-grained scores from mixed population. We think for the MJ data the natural objective function should 
be the average contact energy, which is the counterpart of the BLOSUM H by replacing the logarithm of odds 
Sij with the contact energy Cij . The coarse-grained contact energy can be deduced from mixed population. 
This will furnishe us a consistent way for cluster analysis. 

This work was finished during the author's visit to the Bartol Research Institute, University 
of Delaware. The author thanks Dr. S.T. Chui for the warm hospitality and fruitful discussions. 

This work was supported in part by the Special Funds for Major National Basic Research Projects, 
the National Natural Science Foundation of China and Research Project 248 of Beijing. 
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Table 1. Clustering based on the MJ pair frequency counts with the centroid method. The first column 
indicates the step in the hierachical clustering. 
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Table II. Clustering based on the BLOSUM62 frequency counts with the centroid method. 
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Table III. Clustering based on the BLOSUM62 frequency counts with Ward's method. 
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Table IV. Ordinal cophenetic matrices of the MJ-clustering (lower) and HH-clustering (upper). 
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Table V. Difference of the BL ordinal cophenetic metrix from those of the M J- (lower) and HH-clustering 

(upper). 
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Table VI. Difference of the MJ- (lower) and BL-clustering (upper) ordinal cophenetic metrices from that 

of the GR-eliistering. 
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Fig. 1 Relationship between everage score and number of clusters. (Here the score is in the natural 
logarithm instead of taking base 2.) 
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